10 research outputs found

    Population variability in the generation and thymic selection of T-cell repertoires

    Full text link
    The diversity of T-cell receptor (TCR) repertoires is achieved by a combination of two intrinsically stochastic steps: random receptor generation by VDJ recombination, and selection based on the recognition of random self-peptides presented on the major histocompatibility complex. These processes lead to a large receptor variability within and between individuals. However, the characterization of the variability is hampered by the limited size of the sampled repertoires. We introduce a new software tool SONIA to facilitate inference of individual-specific computational models for the generation and selection of the TCR beta chain (TRB) from sequenced repertoires of 651 individuals, separating and quantifying the variability of the two processes of generation and selection in the population. We find not only that most of the variability is driven by the VDJ generation process, but there is a large degree of consistency between individuals with the inter-individual variance of repertoires being about 2% of the intra-individual variance. Known viral-specific TCRs follow the same generation and selection statistics as all TCRs.Comment: 13 pages, 7 figure, 2 table

    On generative models of T-cell receptor sequences

    Full text link
    T-cell receptors (TCR) are key proteins of the adaptive immune system, generated randomly in each individual, whose diversity underlies our ability to recognize infections and malignancies. Modeling the distribution of TCR sequences is of key importance for immunology and medical applications. Here, we compare two inference methods trained on high-throughput sequencing data: a knowledge-guided approach, which accounts for the details of sequence generation, supplemented by a physics-inspired model of selection; and a knowledge-free Variational Auto-Encoder based on deep artificial neural networks. We show that the knowledge-guided model outperforms the deep network approach at predicting TCR probabilities, while being more interpretable, at a lower computational cost

    OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs

    Full text link
    Motivation: High-throughput sequencing of large immune repertoires has enabled the development of methods to predict the probability of generation by V(D)J recombination of T- and B-cell receptors of any specific nucleotide sequence. These generation probabilities are very non-homogeneous, ranging over 20 orders of magnitude in real repertoires. Since the function of a receptor really depends on its protein sequence, it is important to be able to predict this probability of generation at the amino acid level. However, brute-force summation over all the nucleotide sequences with the correct amino acid translation is computationally intractable. The purpose of this paper is to present a solution to this problem. Results: We use dynamic programming to construct an efficient and flexible algorithm, called OLGA (Optimized Likelihood estimate of immunoGlobulin Amino-acid sequences), for calculating the probability of generating a given CDR3 amino acid sequence or motif, with or without V/J restriction, as a result of V(D)J recombination in B or T cells. We apply it to databases of epitope-specific T-cell receptors to evaluate the probability that a typical human subject will possess T cells responsive to specific disease-associated epitopes. The model prediction shows an excellent agreement with published data. We suggest that OLGA may be a useful tool to guide vaccine design. Availability: Source code is available at https://github.com/zsethna/OLG

    Inferring processes underlying B-cell repertoire diversity

    Full text link
    We quantify the VDJ recombination and somatic hypermutation processes in human B-cells using probabilistic inference methods on high-throughput DNA sequence repertoires of human B-cell receptor heavy chains. Our analysis captures the statistical properties of the naive repertoire, first after its initial generation via VDJ recombination and then after selection for functionality. We also infer statistical properties of the somatic hypermutation machinery (exclusive of subsequent effects of selection). Our main results are the following: the B-cell repertoire is substantially more diverse than T-cell repertoires, due to longer junctional insertions; sequences that pass initial selection are distinguished by having a higher probability of being generated in a VDJ recombination event; somatic hypermutations have a non-uniform distribution along the V gene that is well explained by an independent site model for the sequence context around the hypermutation site.Comment: acknowledgement adde

    Probability, Entropy, and Adaptive Immune System Repertoires

    No full text
    The adaptive immune system, composed of white blood cells called lymphocytes (B and T cells) that circulate in the lymph and blood, is a precision tool that tags and removes foreign peptides. Such peptides, also called antigens or epitopes, are identified by a specific binding to elements of a library or repertoire of unique proteins called receptors (e.g. antibodies or T cell receptors). A repertoire must be large and diverse enough so that at least one receptor will be able to recognize any pathogen epitope the organism is likely to encounter. This diversity is achieved by stochastic rearrangement of the germline DNA to create novel complementarity determining region sequences (CDR3) in a process called called V(D)J recombination. In this thesis we utilize previously developed generative models of V(D)J recombi- nation events, and infer the model parameters from large datasets of DNA sequences. The generation probability (Pgen) of a nucleotide or amino acid CDR3 is the sum of all model probabilities of V(D)J recombination events that generate the sequence. While previously it was only feasible to compute Pgen of nucleotide sequences, we introduce a novel dynamic programming algorithm that efficiently computes Pgen of amino acid sequences. We use this Pgen for several applications. First we examine how the diversity of a repertoire, characterized by the model entropy, scales with the number of insertions in the V(D)J process. This is used to describe the maturation of the T cell repertoire of mice from embryos to young adults. Next, we introduce a statistical model of hypermutation in B cells and infer the parameters from a human repertoire, providing a principled quantification of the biases in hypermutation rates. Lastly, we examine the statistics of the receptors shared amongst a cohort of more than 600 individual humans and show that the statistics and identities of so-called ‘public’ sequences are determined directly from Pgen. We highlight possible clinical applications and attempt to place this work in the context of a full theory of the adaptive immune system

    Identification of transcriptional programs using dense vector representations defined by mutual information with GeneVector

    No full text
    Abstract Deciphering individual cell phenotypes from cell-specific transcriptional processes requires high dimensional single cell RNA sequencing. However, current dimensionality reduction methods aggregate sparse gene information across cells, without directly measuring the relationships that exist between genes. By performing dimensionality reduction with respect to gene co-expression, low-dimensional features can model these gene-specific relationships and leverage shared signal to overcome sparsity. We describe GeneVector, a scalable framework for dimensionality reduction implemented as a vector space model using mutual information between gene expression. Unlike other methods, including principal component analysis and variational autoencoders, GeneVector uses latent space arithmetic in a lower dimensional gene embedding to identify transcriptional programs and classify cell types. In this work, we show in four single cell RNA-seq datasets that GeneVector was able to capture phenotype-specific pathways, perform batch effect correction, interactively annotate cell types, and identify pathway variation with treatment over time

    Fundamental immune–oncogenicity trade-offs define driver mutation fitness

    No full text
    Missense driver mutations in cancer are concentrated in a few hotspots(1). Various mechanisms have been proposed to explain this skew, including biased mutational processes(2), phenotypic differences(3–6) and immunoediting of neoantigens(7,8); however, to our knowledge, no existing model weighs the relative contribution of these features to tumour evolution. We propose a unified theoretical ‘free fitness’ framework that parsimoniously integrates multimodal genomic, epigenetic, transcriptomic and proteomic data into a biophysical model of the rate-limiting processes underlying the fitness advantage conferred on cancer cells by driver gene mutations. Focusing on TP53, the most mutated gene in cancer(1), we present an inference of mutant p53 concentration and demonstrate that TP53 hotspot mutations optimally solve an evolutionary trade-off between oncogenic potential and neoantigen immunogenicity. Our model anticipates patient survival in The Cancer Genome Atlas and patients with lung cancer treated with immunotherapy as well as the age of tumour onset in germline carriers of TP53 variants. The predicted differential immunogenicity between hotspot mutations was validated experimentally in patients with cancer and in a unique large dataset of healthy individuals. Our data indicate that immune selective pressure on TP53 mutations has a smaller role in non-cancerous lesions than in tumours, suggesting that targeted immunotherapy may offer an early prophylactic opportunity for the former. Determining the relative contribution of immunogenicity and oncogenic function to the selective advantage of hotspot mutations thus has important implications for both precision immunotherapies and our understanding of tumour evolution
    corecore